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Abstract 

Methods for learning to search for structured prediction typically imitate a reference policy, with existing theoret¬ 
ical guarantees demonstrating low regret compared to that reference. This is unsatisfactory in many applications 
where the reference policy is suboptimal and the goal of learning is to improve upon it. Can learning to search 
work even when the reference is poor? 

We provide a new learning to search algorithm, LOTS, which does well relative to the reference policy, but 
additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. 
Consequently, LOTS can improve upon the reference policy, unlike previous algorithms. This enables us to 
develop structured contextual bandits , a partial information structured prediction setting with many potential 
applications. 


1. Introduction 

In structured prediction problems, a learner makes joint predictions over a set of interdependent output variables and ob¬ 
serves a joint loss. For example, in a parsing task, the output is a parse tree over a sentence. Achieving optimal performance 
commonly requires the prediction of each output variable to depend on neighboring variables. One approach to structured 
prediction is learning to search (l2s) (Collins & Roark, 2004; Daume III & Marcu, 2005; Daume III et al., 2009; Ross 
et al., 2011; Doppa et al., 2014; Ross & Bagnell, 2014), which solves the problem by: 

1. converting structured prediction into a search problem with specified search space and actions; 

2. defining structured features over each state to capture the interdependency between output variables; 

3. constructing a reference policy based on training data; 

4. learning a policy that imitates the reference policy. 

Empirically, l2s approaches have been shown to be competitive with other structured prediction approaches both in ac¬ 
curacy and running time (see e.g. Daume III et al. (2014)). Theoretically, existing l2s algorithms guarantee that if the 
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Figure 1. An illustration of the search space of a sequential tagging example that assigns a part-of-speech tag sequence to the sen¬ 
tence “John saw Mary.” Each state represents a partial labeling. The start state b = [_] and the set of end states E = 

{[^ Y ^], [fV y y],...}. Each end state is associated with a loss. A policy chooses an action at each state in the search space to 
specify the next state. 



learning step performs well, then the learned policy is almost as good as the reference policy, implicitly assuming that the 
reference policy attains good performance. Good reference policies are typically derived using labels in the training data, 
such as assigning each word to its correct POS tag. However, when the reference policy is suboptimal, which can arise for 
reasons such as computational constraints, nothing can be said for existing approaches. 

This problem is most obviously manifest in a “structured contextual bandit”' setting. For example, one might want to 
predict how the landing page of a high prohle website should be displayed; this involves many interdependent predictions: 
items to show, position and size of those items, font, color, layout, etc. It may be plausible to derive a quality signal for the 
displayed page based on user feedback, and we may have access to a reasonable reference policy (namely the existing rule- 
based system that renders the current web page). But, applying l2s techniques results in nonsense—learning something 
almost as good as the existing policy is useless as we can just keep using the current system and obtain that guarantee. 
Unlike the full feedback settings, label information is not even available during learning to dehne a substantially better 
reference. The goal of learning here is to improve upon the current system, which is most likely far from optimal. This 
naturally leads to the question: is learning to search useless when the reference policy is poor? 

This is the core question of the paper, which we address hrst with a new l2s algorithm, LOTS (Locally Optimal Learning 
to Search) in Section 2. LOLS operates in an online fashion and achieves a bound on a convex combination of regret- 
to-reference and regret-to-own-one-step-deviations. The first part ensures that good reference policies can be leveraged 
effectively; the second part ensures that even if the reference policy is very sub-optimal, the learned policy is approximately 
“locally optimal” in a sense made formal in Section 3. 

LOLS operates according to a general schematic that encompases many past l2s algorithms (see Section 2), including 
Seam (Daume III et al., 2009), DAgger (Ross et ah, 2011) and AggreVaTe (Ross & Bagnell, 2014). A secondary contri¬ 
bution of this paper is a theoretical analysis of both good and bad ways of instantiating this schematic under a variety of 
conditions, including: whether the reference policy is optimal or not, and whether the reference policy is in the hypothesis 
class or not. We hnd that, while past algorithms achieve good regret guarantees when the reference policy is optimal, they 
can fail rather dramatically when it is not. LOLS, on the other hand, has superior performance to other l2s algorithms 
when the reference policy performs poorly but local hill-climbing in policy space is effective. In Section 5, we empirically 
confirm that LOLS can significantly outperform the reference policy in practice on real-world datasets. 

In Section 4 we extend LOLS to address the structured contextual bandit setting, giving a natural modification to the 
algorithm as well as the corresponding regret analysis. 

The algorithm LOLS, the new kind of regret guarantee it satishes, the modifications for the stractured contextual bandit 
setting, and all experiments are new here. 

2. Learning to Search 

A structured prediction problem consists of an input space X, an output space y, a hxed but unknown distribution V over 
X xy, and a non-negative loss function £{y* ,y) — > which measures the distance between the true (y*) and predicted 

(y) outputs. The goal of stmctured learning is to use N samples (xi, yi)^-i to learn a mapping f : X ^ y that minimizes 
the expected stractured loss under V. 

In the learning to search framework, an input x G X induces a search space, consisting of an initial state b (which we will 

*The key difference from (1) contextual bandits is that the action space is exponentially large (in the length of trajectories in the 
search space); and from (2) reinforcement learning is that a baseline reference policy exists before learning starts. 
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Figure 2. An example search space. The exploration begins at the start state s and chooses the middle among three actions by the roll- 
in policy twice. Grey nodes are not explored. At state r the learning algorithm considers the chosen action (middle) and both one-step 
deviations from that action (top and bottom). Each of these deviations is completed using the roll-OUt policy until an end state is reached, 
at which point the loss is collected. Here, we learn that deviating to the top action (instead of middle) at state r decreases the loss by 0.2. 


take to also encode x), a set of end states and a transition function that takes state/action pairs s, a and deterministically 
transitions to a new state s'. For each end state e, there is a corresponding structured output and for convenience we 
define the loss £{e) = £{y* ,ye) where y* will be clear from context. We futher define a feature generating function $ that 
maps states to feature vectors in The features express both the input x and previous predictions (actions). Fig. 1 shows 
an example search space^. 

An agent follows a policy tt S If, which chooses an action a G ^(s) at each non-terminal state s. An action specifies 
the next state from s. We consider policies that only access state s through its feature vector meaning that 7r(s) is 
a mapping from to the set of actions A(s). A trajectory is a complete sequence of state/action pairs from the starting 
state b to an end state e. Trajectories can be generated by repeatedly executing a policy tt in the search space. Without 
loss of generality, we assume the lengths of trajectories are fixed and equal to T. The expected loss of a policy J{Tr) is 
the expected loss of the end state of the trajectory e ~ tt, where e G E is an end state reached by following the policy^. 
Throughout, expectations are taken with respect to draws of [x, y) from the training distribution, as well as any internal 
randomness in the learning algorithm. 

An optimal policy chooses the action leading to the minimal expected loss at each state. For losses decomposable over the 
states in a trajectory, generating an optimal policy is trivial given y* (e.g., the sequence tagging example in (Daume III 
et al., 2009)). In general, finding the optimal action at states not in the optimal trajectory can be tricky (e.g., (Goldberg & 
Nivre, 2013; Goldberg et al., 2014)). 

Finally, like most other l2s algorithms, LOTS assumes access to a cost-sensitive classification algorithm. A cost-sensitive 
classifier predicts a label y given an example x, and receives a loss Cx{y), where is a vector containing the cost for each 
possible label. In order to perform online updates, we assume access to a no-regret online cost-sensitive learner, which we 
formally define below. 

Definition 1. Given a hypothesis class "H : A” —> [K], the regret of an online cost-sensitive classification algorithm which 

^Doppa et al. (2014) discuss several approaches for defining a search space. The theoretical properties of our approach do not depend 
on which search space definition is used. 

^Some imitation learning literature (e.g., (Ross et al., 2011; He et al., 2012)) defines the loss of a policy as an accumulation of 
the costs of states and actions in the trajectory generated by the policy. For simplicity, we define the loss only based on the end state. 
However, our theorems can be generalized. 
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Algorithm 1 Locally Optimal Learning to Search (LOLS) 

Require: Dataset {a;^, yi}^i drawn from T> and /3 > 0: a mixture parameter for roll-out. 
1: Initialize a policy wq. 

2: for all z € {1,2,, N} (loop over each instance) do 
3: Generate a reference policy based on y^. 

4: Initialize L = 0. 

5: for all f G {0,1,2,..., r — 1} do 

6: Roll-in by executing tt” = ifi for t rounds and reach st- 

7: for all a G A(st) do 

8: Let tt™' = with probability f3, otherwise iti. 

9: Evaluate cost by rolling-out with for T — t — 1 steps. 

10: end for 

11: Generate a feature vector ^{xi,st). 

12: Set r = r U ^{Xi, St))}. 

13: end for 

14: TTi+i ^ Train(7ri,r) (Update). 

15: end for 

16: Return the average policy across ttq, tti, ... ttjv. 


produces hypotheses hi,^ Hm on cost-sensitive example sequence {(a:i, Ci),..., {xM^ Cm)} is 

M M 

Regret^ = V] Cm(hm{xm)) - min ^ Cm(h{xm))- (1) 

m—1 m—1 


An algorithm is no-regret //'Regret^| = o{M). 

Such no-regret guarantees can be obtained, for instance, by applying the SECOC technique (Langford & Beygelzimer, 
2005) on top of any importance weighted binary classification algorithm that operates in an online fashion, examples being 
the perceptron algorithm or online ridge regression. 

LOLS (see Algorithm 1) learns a policy if G 11 to approximately minimize assuming access to a reference policy 

(which may or may not be optimal). The algorithm proceeds in an online fashion generating a sequence of learned 
policies TTg, TTi, 7f2,.... At round i, a structured sample {xi, yj is observed, and the configuration of a search space is 
generated along with the reference policy tt"^®^. Based on {xi, y^), LOLS constructs T cost-sensitive multiclass examples 
using a roll-in policy tt” and a roll-out policy 7r°®‘. The roll-in policy is used to generate an initial trajectory and the roll-out 
policy is used to derive the expected loss. More specifically, for each decision point t G [0, T), LOLS executes tt’" for t 
rounds reaching a state st ~ tt™. Then, a cost-sensitive multiclass example is generated using the features $(st). Classes 
in the multiclass example correspond to available actions in state St. The cost c(a) assigned to action a is the difference in 
loss between taking action a and the best action. 


c(a) = £{e{a)) — min('(e(a')), 

a' 


( 2 ) 


where e{a) is the end state reached with rollout by 7r°“‘ after taking action a in state st- LOLS collects the T examples from 
the different roll-out points and feeds the set of examples E into an online cost-sensitive multiclass learner, thereby updating 
the learned policy from ffi to rti+i. By default, we use the learned policy Tf^ for roll-in and a mixture policy for roll-out. 
Eor each roll-out, the mixture policy either executes tt"^®^ to an end-state with probability /3 or ni with probability 1 — /?. 
LOLS converts into a batch algorithm with a standard online-to-batch conversion where the final model tt is generated by 
averaging Tr^ across all rounds (i.e., picking one of ifi,... if at uniformly at random). 

'* We can parameterize the policy if using a weight vector lo G R'* such that a cost-sensitive classifier can be used to choose an action 
based on the features at each state. We do not consider using different weight vectors at different states. 
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roll-out — 

4 , roll-in 

Reference 

Mixture 

Learned 

Reference 

Inconsistent 

Learned 

Not locally opt. 

Good 

RL 


Table 1. Effect of different roll-in and roll-out policies. The strategies marked with “Inconsistent” might generate a learned policy with 
a large structured regret, and the strategies marked with “Not locally opt.” could be much worse than its one step deviation. The strategy 
marked with “RL” reduces the structure learning problem to a reinforcement learning problem, which is much harder. The strategy 
marked with “Good” is favored. 
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Figure 3. Counterexamples of vrf = and 7 r°“’ = All three examples have 7 states. The loss of each end state is specified in the 
figure. A policy chooses actions to traverse through the search space until it reaches an end state. Legal policies are bit-vectors, so that a 
policy with a weight on a goes up in si of Figure 3(a) while a weight on b sends it down. Since features uniquely identify actions of the 
policy in this case, we just mark the edges with corresponding features for simplicity. The reference policy is bold-faced. In Figure 3(b), 
the features are the same on either branch from si, so that the learned policy can do no better than pick randomly between the two. In 
Figure 3(c), states S 2 and S 3 share the same feature set (i.e., (f>(s 2 ) = $( 33 )). Therefore, a policy chooses the same set of actions at 
states S2 and S3. Please see text for details. 


3. Theoretical Analysis 

In this section, we analyze LOLS and answer the questions raised in Section 1. Throughout this section we use tt to denote 
the average policy obtained by first choosing n G [1, A] uniformly at random and then acting according to 7 r„.We begin 
with discussing the choices of roll-in and roll-out policies. Table 1 summarizes the results of using different strategies for 
roll-in and roll-out. 

3.1. The Bad Choices 

An obvious bad choice is roll-in and roll-out with the learned policy, because the learner is blind to the reference policy. It 
reduces the structured learning problem to a reinforcement learning problem, which is much harder. To build intuition, we 
show two other bad cases. 

Roll-in with is bad. Roll-in with a reference policy causes the state distribution to be unrealistically good. As a result, 
the learned policy never learns to correct for previous mistakes, performing poorly when testing. A related discussion can 
be found at Theorem 2.1 in (Ross & Bagnell, 2010). We show a theorem below. 

Theorem 1. For tt'" = there is a distribution D over (x, y) such that the induced cost-sensitive regret Regret^ = 
o{M) but J{tt) — J{'K''‘f) = n(l). 

Proof. We demonstrate examples where the claim is true. 

We start with the case where = tt’" = In this case, suppose we have one structured example, whose search 
space is defined as in Figure 3(a). From state si, there are two possible actions: a and b (we will use actions and features 
interchangeably since features uniquely identify actions here); the (optimal) reference policy takes action a. From state S 2 , 
there are again two actions (c and d); the reference takes c. Finally, even though the reference policy would never visit S 3 , 
from that state it chooses action /. When rolling in with the cost-sensitive examples are generated only at state si (if 
we take a one-step deviation on si) and S 2 but never at S 3 (since that would require a two deviations, one at si and one 
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at S3). As a result, we can never learn how to make predictions at state S3. Furthermore, under a rollout with both 
actions from state si lead to a loss of zero. The learner can therefore learn to take action c at state S2 and b at state si, and 
achieve zero cost-sensitive regret, thereby “thinking” it is doing a good job. Unfortunately, when this policy is actually run, 
it performs as badly as possible (by taking action e half the time in S3), which results in the large structured regret. 

Next we consider the case where is either the learned policy or a mixture with When applied to the example in 
Figure 3(b), our feature representation is not expressive enough to differentiate between the two actions at state si, so the 
learned policy can do no better than pick randomly between the top and bottom branches from this state. The algorithm 
either rolls in with on si and generates a cost-sensitive example at S 2 , or generates a cost-sensitive example on si and 
then completes a roll out with tt™'. Crucially, the algorithm still never generates a cost-sensitive example at the state S 3 
(since it would have already taken a one-step deviation to reach S3 and is constrained to do a roll out from S3). As a result, 
if the learned policy were to choose the action e in S3, it leads to a zero cost-sensitive regret but large structured regret. □ 

Despite these negative results, rolling in with the learned policy is robust to both the above failure modes. In Figure 3(a), if 
the learned policy picks action b in state si, then we can roll in to the state S3, then generate a cost-sensitive example and 
learn that / is a better action than e. Similarly, we also observe a cost-sensitive example in S3 in the example of Figure 3(b), 
which clearly demonstrates the benefits of rolling in with the learned policy as opposed to 

Roll-out with is bad if is not optimal. When the reference policy is not optimal or the reference policy is not in 
the hypothesis class, roll-out with can make the learner blind to compounding errors. The following theorem holds. We 
state this in terms of “local optimality”: a policy is locally optimal if changing any one decision it makes never improves 
its performance. 

Theorem 2. For 7r“" = there is a distribution D over {x, y) such that the induced cost-sensitive regret Regret^ = 
o{M) but TT has arbitrarily large structured regret to one-step deviations. 

Proof. Suppose we have only one structured example, whose search space is defined as in Figure 3(c) and the reference 
policy chooses a or c depending on the node. If we roll-out with we observe expected losses 1 and 1 -|- e for actions 
a and b at state si, respectively. Therefore, the policy with zero cost-sensitive classification regret chooses actions a and d 
depending on the node. However, a one step deviation (a — b) does radically better and can be learned by instead rolling 
out with a mixture policy. □ 

The above theorems show the bad cases and motivate a good l2s algorithm which generates a learned policy that competes 
with the reference policy and deviations from the learned policy. In the following section, we show that Algorithm 1 is such 
an algorithm. 


3.2. Regret Guarantees 

Let Q'^{st,a) represent the expected loss of executing action a at state St and then executing policy tt until reaching an end 
state. T is the number of decisions required before reaching an end state. For notational simplicity, we use Q'^{st, tt') as a 
shorthand for Q^{st, 7 r'{st)), where 7r'(st) is the action that tt' takes at state st. Finally, we use to denote the distribution 
over states at time t when acting according to the policy tt. The expected loss of a policy is: 


J{tt) = tt)] , (3) 

for any t G [0, T]. In words, this is the expected cost of rolling in with tt up to some time t, taking tt’s action at time t and 
then completing the roll out with tt. 

Our main regret guarantee for Algorithm 1 shows that LOTS minimizes a combination of regret to the reference policy 
and regret its own one-step deviations. In order to concisely present the result, we present an additional definition which 
captures the regret of our approach: 


NT 

Sn = - (/ 3 mmQ’"'"(s,a)-b (1 -/ 3 )mmQ’"'(s,a)) 


(4) 


where tt™* = /Stt®®^ -b (1 — /3)TTi is the mixture policy used to roll-out in Algorithm 1. With these definitions in place, we 
can now state our main result for Algorithm 1. 
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Theorem 3. Let Sn be as defined in Equation 4. The averaged policy tt generated by running N steps of Algorithm 1 with 
a mixing parameter /3 satisfies 


P{J{n) - + (1 - ^ (J(7r) - minE^ 

' ^ TTfcli 

t=l 

<T5m. 


4 [Q’'(s,7r)]) 


It might appear that the LHS of the theorem combines one term which is constant to another scaling with T. We point 
the reader to Lemma 1 in the appendix to see why the terms are comparable in magnitude. Note that the theorem does not 
assume anything about the quality of the reference policy, and it might be arbitrarily suboptimal. Assuming that Algorithm 1 
uses a no-regret cost-sensitive classification algorithm (recall Definition 1), the first term in the definition of S^- converges 
to 


= min , „ 
ti-gh NT 


N T 

VtEEe. 




[Q^ 


(S,TT 


i=l t=l 


This observation is formalized in the next corollary. 

Corollary 1. Suppose we use a no-regret cost-sensitive classifier in Algorithm 1. As N ^ oo, Sn ^ ^class’ ^here 


= f * - — V E 

NT ^ 


^class 


NT 


/3minQ'^ (s, a)-f (1 —/3) min (s, a) 


When we have /3 = 1, so that LOLS becomes almost identical to AggreVaTe (Ross & Bagnell, 2014), (^dass ^ises 
solely due to the policy class 11 being restricted. For other values of /3 S (0,1), the asymptotic gap does not always vanish 
even if the policy class is unrestricted, since I* amounts to obtaining mina (s, a) in each state. This corresponds to 
taking a minimum of an average rather than the average of the corresponding minimum values. 

In order to avoid this asymptotic gap, it seems desirable to have regrets to reference policy and one-step deviations con¬ 
trolled individually, which is equivalent to having the guarantee of Theorem 3 for all values of /? in [0,1] rather than a 
specific one. As we show in the next section, guaranteeing a regret bound to one-step deviations when the reference pol¬ 
icy is arbitrarily bad is rather tricky and can take an exponentially long time. Understanding structures where this can be 
done more tractably is an important question for future research. Nevertheless, the result of Theorem 3 has interesting 
consequences in several settings, some of which we discuss next. 

1. The second term on the left in the theorem is always non-negative by definition, so the conclusion of Theorem 3 is 
at least as powerful as existing regret guarantee to reference policy when jS = 1. Since the previous works in this 
area (Daume III et al., 2009; Ross et al., 2011; Ross & Bagnell, 2014) have only studied regret guarantees to the 
reference policy, the quantity we’re studying is strictly more difficult. 

2. The asymptotic regret incurred by using a mixture policy for roll-out might be larger than that using the reference 
policy alone, when the reference policy is near-optimal. How the combination of these factors manifests in practice is 
empirically evaluated in Section 5. 

3. When the reference policy is optimal, the first term is non-negative. Consequently, the theorem demonstrates that our 

algorithm competes with one-step deviations in this case. This is true irrespective of whether is in the policy class 

n or not. 

4. When the reference policy is very suboptimal, then the first term can be negative. In this case, the regret to one-step 
deviations can be large despite the guarantee of Theorem 3, since the first negative term allows the second term to 
be large while the sum stays bounded. However, when the first term is significantly negative, then the learned policy 
has already improved upon the reference policy substantially! This ability to improve upon a poor reference policy by 
using a mixture policy for rolling out is an important distinction for Algorithm 1 compared with previous approaches. 

Overall, Theorem 3 shows that the learned policy is either competitive with the reference policy and nearly locally optimal, 
or improves substantially upon the reference policy. 
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3.3. Hardness of local optimality 

In this section we demonstrate that the process of reaching a local optimum (under one-step deviations) can be exponentially 
slow when the initial starting policy is arbitrary. This reflects the hardness of learning to search problems when equipped 
with a poor reference policy, even if local rather than global optimality is considered a yardstick. We establish this lower 
bound for a class of algorithms substantially more powerful than LOTS. We start by defining a search space and a policy 
class. Our search space consists of trajectories of length T, with 2 actions available at each step of the trajectory. We use 0 
and 1 to index the two actions. We consider policies whose only feature in a state is the depth of the state in the trajectory, 
meaning that the action taken by any policy tt in a state st depends only on t. Consequently, each policy can be indexed 
by a bit string of length T. For instance, the policy 0100 ... 0 executes action 0 in the first step of any trajectory, action 1 
in the second step and 0 at all other levels. It is easily seen that two policies are one-step deviations of each other if the 
corresponding bit strings have a Hamming distance of 1. 

To establish a lower bound, consider the following powerful algorithmic pattern. Given a curi'ent policy tt, the algorithm 
examines the cost J{tt') for all the one-step deviations tt' of tt. It then chooses the policy with the smallest cost as its new 
learned policy. Note that access to the actual costs J{tt) makes this algorithm more powerful than existing l 2 s algorithms, 
which can only estimate costs of policies through rollouts on individual examples. Suppose this algorithm starts from an 
initial policy TTg. How long does it take for the algorithm to reach a policy iti which is locally optimal compared with all 
its one-step deviations? We next present a lower bound for algorithms of this style. 

Theorem 4. Consider any algorithm which updates policies only by moving from the current policy to a one-step deviation. 
Then there is a search space, a policy class and a cost function where the any such algorithm must make H(2^) updates 
before reaching a locally optimal policy. Specifically, the lower bound also applies to Algorithm 1. 

The result shows that competing with the seemingly reasonable benchmark of one-step deviations may be very challenging 
from an algorithmic perspective, at least without assumptions on the search space, policy class, loss function, or starting 
policy. For instance, the construction used to prove Theorem 4 does not apply to Hamming loss. 

4. Structured Contextual Bandit 

We now show that a variant of LOTS can be run in a “structured contextual bandit” setting, where only the loss of a single 
structured label can be observed. As mentioned, this setting has applications to webpage layout, personalized search, and 
several other domains. 

At each round, the learner is given an input example x, makes a prediction y and suffers structured loss i{y*,y). We 
assume that the structured losses lie in the interval [0,1], that the search space has depth T and that there are at most K 
actions available at each state. As before, the algorithm has access to a policy class H, and also to a reference policy It 
is important to emphasize that the reference policy does not have access to the true label, and the goal is improving on the 
reference policy. 

Our approach is based on the e-greedy algorithm which is a common strategy in partial feedback problems. Upon receiving 
an example Xi, the algorithm randomly chooses whether to explore or exploit on this example. With probability 1 — e, the 
algorithm chooses to exploit and follows the recommendation of the current learned policy. With the remaining probability, 
the algorithm performs a randomized variant of the LOTS update. A detailed description is given in Algorithm 2. 

We assess the algorithm’s performance via a measure of regret, where the comparator is a mixture of the reference policy 
and the best one-step deviation. Let Tf^ be the averaged policy based on all policies in I at round i. y^^ is the predicted label 
in either step 9 or step 14 of Algorithm 2. The average regret is defined as: 

Af T 

Regret = {Wiy*, V ie)] “ , V iej] “ (1 “ /3) XI ^)]) 

' i=l t=l 

Recalling our earlier definition of Si (4), we bound on the regret of Algorithm 2 with a proof in the appendix. 

Theorem 5. Algorithm 2 with parameter e satisfies: 

1 ^ 

Regret < e + 
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Algorithm 2 Structured Contextual Bandit Learning 

Require: Examples reference policy exploration probability e and mixture parameter /3 > 0. 

1 : Initialize a policy tto, and set Z = 0. 

2: for all i = 1,2,..., N (loop over each instance) do 

3: Obtain the example Xi, set explore = 1 with probability e, set rii = \I\. 

4: if explore then 

5: Pick random time t € {0,1,..., T — 1}. 

6 : Roll-in by executing tt” = 7 r„, for t rounds and reach St- 

7: Pick random action at G A(st); let K = |A(sf)|. 

8 : Let 7 r™‘ = with probability /?, otherwise 7 f„.. 

9: Roll-out with tt™' for T — f — 1 steps to evaluate 

c(a) = K£(e(at))l[a = at]. 

10: Generate a feature vector St). 

11 : 7 r„,+i ^ Train( 7 f„,,c, St))- 

12: Augment I = Z U { 7 f„.+i} 

13: else 

14: Follow the trajectory of a policy tt drawn randomly from Z to an end state e, predict the corresponding structured 

output 

15: end if 

16: end for 


With a no-regret learning algorithm, we expect 


^ ^class 




log |n| 

i 


(5) 


where |n| is the cardinality of the policy class. This leads to the following corollary with a proof in the appendix. 

Corollary 2. In the setup of Theorem 5, suppose further that the underlying no-regret learner satisfies (5). Then with 
probability at least 1 — 2/K'^T'^ log(A^|n|))^, 


Regret = O ^ 


5 . Experiments 

This section shows that LOLS is able to improve upon a suboptimal reference policy and provides empirical evidence to 
support the analysis in Section 3. We conducted experiments on the following three applications. 

Cost-Sensitive Multiclass classification. For each cost-sensitive multiclass sample, each choice of label has an associated 
cost. The search space for this task is a binary search tree. The root of the tree corresponds to the whole set of labels. We 
recursively split the set of labels in half, until each subset contains only one label. A trajectory through the search space is a 
path from root-to-leaf in this tree. The loss of the end state is dehned by the cost. An optimal reference policy can lead the 
agent to the end state with the minimal cost. We also show results of using a bad reference policy which arbitrarily chooses 
an action at each state. The experiments are conducted on KDDCup 99 dataset^ generated from a computer network 
intrusion detection task. The dataset contains 5 classes, 4,898,431 training and 311,029 test instances. 

Part of speech tagging. The search space for POS tagging is left-to-right prediction. Under Hamming loss the trivial 
optimal reference policy simply chooses the correct part of speech for each word. We train on 38A: sentences and test on 
Ilk from the Penn Treebank (Marcus et ah, 1993). One can construct suboptimal or even bad reference policies, but under 
Hamming loss these are all equivalent to the optimal policy because roll-outs by any hxed policy will incur exactly the 
same loss and the learner can immediately learn from one-step deviations. 

^ http://kdd.ics.uci.edu/databases/kddcup9 9/kddcup99 . html 













Learning to Search Better than Your Teacher 


roll-out 

1 roll-in 

Reference 

Mixture 

Learned 

Reference is optimal 

Reference 

0.282 

0.282 

0.279 

Learned 

0.267 

0.266 

0.266 

Reference is bad 

Reference 

1.670 

1.664 

0.316 

Learned 

0.266 

0.266 

0.266 


Table 2. The average cost on cost-sensitive classification dataset; columns are roll-out and rows are roll-in. The best result is bold. 
Searn achieves 0.281 and 0.282 when the reference policy is optimal and had, respectively. LOLS is Leamed/Mixture and high¬ 
lighted in green. 


roll-out —I 

1 roll-in 

Reference 

Mixture 

Learned 

Reference is optimal 

Reference 

95.58 

94.12 

94.10 

Learned 

95.61 

94.13 

94.10 


Table 3. The accuracy on POS tagging; columns are roll-out and rows are roll-in. The best result is bold. Searn achieves 94.88. LOLS 
is Leamed/Mixture and highlighted in green. 


Dependency parsing. A dependency parser learns to generate a tree structure describing the syntactic dependencies be¬ 
tween words in a sentence (McDonald et al., 2005; Nivre, 2003). We implemented a hybrid transition system (Kuhlmann 
et al., 201 1) which parses a sentence from left to right with three actions; SHIFT, ReduceLeft and ReduceRight. We 
used the “non-deterministic oracle” (Goldberg & Nivre, 2013) as the optimal reference policy, which leads the agent to 
the best end state reachable from each state. We also designed two suboptimal reference policies. A bad reference policy 
chooses an arbitrary legal action at each state. A suboptimal policy applies a greedy selection and chooses the action which 
leads to a good tree when it is obvious; otherwise, it arbitrarily chooses a legal action. (This suboptimal reference was the 
default reference policy used prior to the work on “non-deterministic oracles.”) We used data from the Penn Treebank Wall 
Street Journal corpus: the standard data split for training (sections 02-21) and test (section 23). The loss is evaluated in 
UAS (unlabeled attachment score), which measures the fraction of words that pick the correct parent. 

For each task and each reference policy, we compare 6 different combinations of roll-in (learned or reference) and roll-out 
(learned, mixture or reference) strategies. We also include Searn in the comparison, since it has notable differences from 
LOLS. Searn rolls in and out with a mixture where a different policy is drawn for each state, while LOLS draws a 
policy once per example. Searn uses a batch learner, while LOLS uses online. The policy in Searn is a mixture over the 
policies produced at each iteration. For LOLS, it suffices to keep just the most recent one. It is an open research question 
whether an analogous theoretical guarantee of Theorem 3 can be established for Searn. 

Our implementation is based on Vowpal Wabbit^, a machine learning system that supports online learning and l2s. For 
LOLS’s mixture policy, we set /? = 0.5. We found that LOLS is not sensitive to /?, and setting f3 to be 0.5 works well in 
practice. For Searn, we set the mixture parameter to be 1 — (1 — af, where t is the number of rounds and a = 10“®. 
Unless stated otherwise all the learners take 5 passes over the data. 

Tables 2, 3 and 4 show the results on cost-sensitive multiclass classification, POS tagging and dependency parsing, re¬ 
spectively. The empirical results qualitatively agree with the theory. Rolling in with reference is always bad. When the 
reference policy is optimal, then doing roll-outs with reference is a good idea. However, when the reference policy is 
suboptimal or bad, then rolling out with reference is a bad idea, and mixture rollouts perform substantially better. LOLS 
also significantly outperforms Searn on all tasks. 

*http : //hunch.net/~vw/ 
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roll-out 

1 roll-in 

Reference 

Mixture 

Learned 

Reference is optimal 

Reference 

87.2 

89.7 

88.2 

Learned 

90.7 

90.5 

86.9 

Reference is suhoptimal 

Reference 

83.3 

87.2 

81.6 

Learned 

87.1 

90.2 

86.8 

Reference is had 

Reference 

68.7 

65.4 

66.7 

Learned 

75.8 

89.4 

87.5 


Table 4. The UAS score on dependency parsing data set; columns are roll-out and rows are roll-in. The best result is bold. Searn 
achieves 84.0, 81.1, and 63.4 when the reference policy is optimal, suhoptimal, and had, respectively. LOLS is Learned/Mixture 
and highlighted in green. 


6. Proofs of Main Results 

Lemma 1 (Ross & Bagnell Lemma 4.3). For any two policies, tti, 7 r2.' 

J( 7 ri) - J{-K 2 ) = TT 2 )] = - Q^^{s,TT 2 )] 

Proof. Let tt* be a policy that executes tti in the hrst t steps and then executes 7 r 2 from time steps f -|- 1 to T. We have 
J( 7 ri) = and J{tt 2 ) = Consequently, we can set up the telescoping sum: 

T T 

JijTi) - J{'K2) = X! = X! - <3’"^(st,7r2)] 

t=l t=l 

= TEt„,u(^i^T),s^7ri - Q'"^{s,TT 2 )] 

The second equality in the lemma can be obtained by reversing the roles of tti and 7 r 2 above. □ 

6.1. Proof of Theorem 3 

We start with an application of Lemma 1. Using the lemma, we have: 

i 

= ]V E [^^t~U(l,T),s~7., 


{s,n,)-Q^ (s.O 


( 6 ) 


We also observe that 


T 


N T r 




i-1 i=l 

N T r 


J(7fi) - minEs„.rft [Q’'“(s,7r)] 
ttGII ■^2 




1 

< — 
- N 


i=l t=l 
N T 


■^2 TTGII "^2 


E E (S) <3'"’ (s> a)] 


2=1 i=l 


(7) 
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Combining the above bounds from Equations 6 and 7, we see that 

T 

13 (J(7f) - J(7r‘'®f)) + (1 - /3)^ ^J(7f) - minE^^rf|. [Q’"(s,7r)]^ 


N T 


-]v ^ (Q''“\s,7r,) - + (1 - /3) (q""'{s,-k,) - mmQ’"-(s,a)^ 


2=1 t=l 
N T 




2=1 t = l 
N T 


2=1 t = l 


< ^ y] y] <5’"-“' (s, TT,) - /3 min Q’"" (s, a) - (1 - /3) min Q’"- (s, a) 


6.2. Proof of Corollary 1 

The proof is fairly straightforward from definitions. By definition of no-regret, it is immediate that the gap 

N T 

EE E [ci,t{‘f^i{st)) - Ci,t(7r(s())] = o{NT), (8) 

i=l t=l 

for all policies tt G If, where we recall that Ci_t is the cost-vector over the actions on round i when we do roll-outs from the 
tth decision point. Let denote the conditional expectation on round i, conditioned on the previous rounds in Algorithm 1 . 
Then it is easily seen that 


Ei[ci.t(a)] = Ei 


f(ei,t(a)) - minf(ej,t(a )) 


with Bi t being the end-state reached on completing the roll-out with the policy 7r™‘ on round i, when action a was taken 
on the decision point t. Recalling that we rolled in following the trajectory of tt™, this expectation further simplifies to 


Ei[ci.t(a)] = Es„,d^ 


Q^'\s,a) 

-Ei 

min£(ej,t(a )) 



a 


Now taking expectations in Equation 8 and combining with the above observation, we obtain that for any policy tt G If, 


N T 

EE E [ci,t(^i(st)) - Ci,t(7r(st))] 

2=1 i=l 
N T 

= E E 

2=1 t=l 

Taking the best policy tt gH and dividing through by NT completes the proof. 


(5’"^“ (s, 7fi(s)) - (s, 7r(s)) = o{NT). 


6.3. Proof sketch of Theorem 5 

(Sketch only) We decompose the analysis over exploration and exploitation rounds. Eor the exploration rounds, we bound 
the regret by its maximum possible value of 1. To control the regret on the exploitation rounds, we focus on the updates 
performed during exploration. 

The cost vector c(a) used at an exploration round i satisfies 


Ei[c(a)] = Ei [iTf(e(at))l[a = a*]] 


= E 


t~C/(0:T-l),s~dt 


{s,a) 
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Corollary 1. Since the cost vector is identical in expectation as that used in Algorithm 1, the proof of theorem 3, which only 
depends on expectations, can be reused to prove a result similar to theorem 3 for the exploration rounds. That is, letting Hi 
to be the averaged policy over all the policies in X at exploration round i, we have the bound 

T 

- J(7r‘''=')) + (1 - ^ (J(7r) - tt)]) 

' ^ TTfcii ~ 

t = l 

< T&i, 


where Si is as defined in Equation 4. 

On the exploitation rounds, we can now invoke this guarantee. Recalling that we have Ui exploration rounds until round i, 
the expected regret at an exploitation round i is at most Sm ■ Thus the overall regret of the algorithm is at most 


Regret 


< e + 



i=l 


which completes the proof. 


6.4. Proof of corollary 2 

We start by substituting Equation 


5 in the regret bound of Theorem 5. This yields 

„ N r, 


We would like to further replace rii with its expectation which is ei. However, this does not yield a valid upper bound 
directly. Instead, we apply a Chernoff bound to the quantity which is a sum of i i.i.d. Bernoulli random variables with 
mean e. Consequently, we have 


]P(n* < (1 - 7)«) < exp < exp(-ei/8), 

for 7 = 1/2. Let io = 161ogiV/e + 1. Then we can sum the failure probabilities above for all i > iq and obtain 

N N oo 

P {m < ei/2) < exp(—ei/8) < exp(—ei/8) 

i—iQ i—'i'Q ‘i—'i'O 

^ exp(-eio/8) 

“ 1 — exp(—e/8) 

exp(—21ogiV) 8 

exp(e/8) — 1 “ W^e’ 

where the last inequality uses 1 + a; < exp(a;). Consequently, we can now allow a regret of 1 on the first ig rounds, and 
control the regret on the remaining rounds using rii < ei/2. Doing so, we see that with probability at least 1 — 2/(W^e) 


Choosing e 


Regret < e 


/o , . 


cKT 

N 


N 

E 

2=1 


2iog|n| 


161ogiV + e T 8cKT\og\Il\ 

-" + — m — + + —dv— 

(ArT)^/^(log(A^|n|)/7V)^/^ completes the proof. 


6.5. Proof of Theorem 4 

The proof follows from results in combinatorics. The dynamics of algorithms considered here can be thought of as a path 
through a graph where the vertices are the corners of the boolean hypercube in T dimensions with two vertices at Hamming 
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(a) 


CiD 



Figure 4. Pictorial illustration of the proof elements of Theorem 4. Panel (a) depicts the actions chosen by policy 000. Selected action in 
each state is indicated in bold. Panels (b) through (d) depict various stages as the algorithm updates the policy to its one-step deviations, 
starting from the policy 000. Each policy that the algorithm selects is depicted by a shaded circle, with the arrows marking the moves 
of the algorithm. Current policy is the shaded circle with a bold boundary. Dashed lines denote the potential one-step deviations that the 
algorithm can move to and crossed policies are those which have higher costs than the current policy (see text for details). 


distance 1 sharing an edge. We demonstrate that there is a cost function such that the algorithm is forced to traverse a long 
path before reaching a local optimum. Without loss of generality, assume that the algorithm always moves to a one-step 
deviation with the lowest cost since otherwise longer paths exist. 

To gain some intuition, first consider T = 3 which is depicted in Figure 4. Suppose the algorithm starts from the policy 
000 then moves to the policy 001. If the algorithm picks the best amongst the one-step deviations, we know that J(OOl) < 
min{ J(OOO), J(OIO), J(IOO)}, placing constraints on the costs of these policies which force the algorithm to not visit any 
of these policies later. Similarly, if the algorithm moves to the policy Oil next, we obtain a further constraint J(Oll) < 
min{ J(lOl), J(OOl)}. It is easy to check that the only feasible move (corresponding to policies not crossed in Figure 4(c)) 
which decreases the cost under these constraints is to the policy 111 and then 110, at which point the algorithm attains 
local optimality since no more moves that decrease the cost are possible. In general, at any step i of the path, the policy 
TTi is a one-step deviation of 7fi_i and at least 2 or more steps away from TCj for j < i — 1. The policy never moves to a 
neighbor of an ancestor (excluding the immediate parent) in the path. 

This property is the key element to understand more generally. Suppose we have a current path tti — 7(2 ... —>^ fi'i-i —>■ TTi- 
Since we picked the best neighbor of ify as fcj+i, TTi+i cannot be a neighbor of any Try for j < i. Consequently, the 
maximum number of updates the algorithm must make is given by the length of the longest such path on a hypercube, 
where each vertex (other than start and end) neighbors exactly two other vertices on the path. This is called the snake-in- 
the-box problem in combinatorics, and arises in the study of error correcting codes. It is shown by Abbott & Katchalski 
(1988) that the length of longest such path is 0(2^). With monotonically decreasing costs for policies in the path and 
maximal cost for all policies not in the path, the traversal time is 0(2^). 

Finally, it might appear that Algorithm 1 is capable of moving to policies which are not just one-step deviations of the 
currently learned policy, since it performs updates on “mini-batches” of T cost-sensitive examples. However, on this lower 
bound instance. Algorithm 1 will be forced to follow one-step deviations only due to the structure of the cost function. For 
instance, from the policy 000 when we assign maximal cost to policies 010 and 100 in our example, this corresponds to 
making the cost of taking action 1 on hrst and second step very large in the induced cost-sensitive problem. Consequently, 
001 is the policy which minimizes the cost-sensitive loss even when all the T roll-outs are accumulated, implying the 
algorithm is forced to traverse the same long path to local optimality. 
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Algorithm 3 Cost-sensitive One Against All (CSOAA) Algorithm 
Require: Initial predictor fi{x) 

1: for all t = 1,2,... T do 

2: Observe 

3: Predict class it = argmin^j^ ft{xt^i). 

4: Observe costs {ct^i}fLi- 

5: Update ft using online least-squares regression on data {xt^i, Ct,i}fLi- 

6: end for 


A. Details of cost-sensitive reduction 

In this section we present the details of the reduction to cost-sensitive multiclass classihcation used in our experimental 
evaluation. The experiments used the Cost-Sensitive One Against All (CSOAA) classification technique, the pseudocode 
for which is presented in Algorithm 3. In words, the algorithm takes as input a feature vector xt^i for class i at round t. 
It then trains a regressor to predict the corresponding costs Ct^i given the features. Given a fresh example, the predicted 
label is the one with the smallest predicted cost. This is a natural extension of the One Against All (OAA) approach for 
multiclass classification to cost-sensitive settings. Note that this also covers the alternative approach of having a common 
feature vector, Xj ^ = Zt for all i and instead training K different cost predictors, one for each class. If Zt S one can 
simply create xt^i G with xt^i = zt in the ith block and zero elsewhere. Learning a common predictor / on x is now 
representationally equivalent to learning K separate predictors, one for each class. 

There is one missing detail in the specification of Algorithm 3, which is the update step. The specihcs of this step depend 
on the form of the function /(x) being used. For instance, if /(x) = u>^x, then a simple update rule is to use online 
ridge regression (see e^. Section 11.7 in (Cesa-Bianchi & Lugosi, 2006)). Online gradient descent (Zinkevich, 2003) 
on the squared loss — Ct,i)'^ is another simple alternative, which can be used more generally. The specific 

implementation in our experiments uses a more sophisticated variant of online gradient descent with linear functions. 

B. Details of Experiments 

Our implementation is based on Vowpal Wabbit (VW) version 7.8 (http://hunch.net/~vw/). It is available 
at https://github.com/KaiWeiChang/vowpal_wabbit/tree/icmlexp. For LOLS, we use flags 
search_rollin”, “-search_rollout”, “-search_beta” to set the rollin policy, the rollout policy, and /3, respectively. We use 
“-searchJnterpolation policy -search_passes_per.policy -passes 5” to enable Searn. The details settings of various VW 
flags for the three experiments are shown below: 

• POS tagging: we use “-searchJask sequence -search 45 -holdout.off -affix -2w,h-2w -searchmeighborTeatures - 
l:w,l:w -b 28” 

• Dependency parsing: we use “ -searchJask dep.parser -search 12 -holdout.off -searchJiistoryJength 3 - 
searchjio.caching -b 24 -rootJabel 8 -numJabel 12” 

• Cost-sensitive multiclass: we use “-search.task multiclasstask -search 5 -holdout.off -mc.cost” 


The data sets used in the experiments are available upon request. 







